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Abstract 

This manuscript studies statistical properties of linear classifiers obtained through minimiza- 
tion of an unregularized convex risk over a finite sample. Although the results are explicitly 
finite-dimensional, inputs may be passed through feature maps; in this way, in addition to 
treating the consistency of logistic regression, this analysis also handles boosting over a finite 
weak learning class with, for instance, the exponential, logistic, and hinge losses. In this finite- 
dimensional setting, it is still possible to fit arbitrary decision boundaries: scaling the complexity 
of the weak learning class with the sample size leads to the optimal classification risk almost 
surely. 



1 Introduction 



Binary linear classification operates as follows: obtain a new instance, determine a set of real- valued 
features, form their weighted combination, and output a label which is positive iff this combination 
is nonnegative. The interpretability, empirical performance, and theoretical depth of this scheme 



have all contributed to its continue d popularity (jFreund and Schapird . 119971 iFriedman et al 
Caruana and Niculescu- Miziil2006l) . 
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In order to obtain the coefficients in the above weighting, convex optimization is typically em- 
ployed. Specifically, rather than just trying to pick the weighting which makes the fewest mistakes 
over a finite sample — which is computationally intractable — consider instead paying attention 
to the amount by which these combinations clear the zero threshold, a quantity called the margin. 
Applying a convex penalty to these margins yields a convex optimization procedure, specifically one 
which can be specialized into both logistic regression and AdaBoost. 

Statistical analyses of this scheme predominately follow two paths. The first path is a parameter 
estimation approach; positive and negative instances are interpreted as drawn from a family of 
distributions, indexed by the combination weights above, and the convex scheme is performing a 



maximum likelihood search for these parameters ( Friedman et al. . 2000l ). This provides one way 



to analyze logistic regression, specifically the ability of the above convex optimization to recover 
these parameters; these analyses of course req uire such pararneters to exist, and usually for th e full 
problem to obey certain regularity conditions ( Lebanon! l2008l Gourieroux and Monfort , 1981 ) . 

The second approach is focused on the case of binary classification, with an interpretation of 
the data generation process t aking a b a ckgro und role. Indeed, in this setting, optimal parame- 
ters may simply fail to exist (jSchapir and the convex optimization procedure can pro- 
duce unboundedly large weightings. Analyses first focused on the separable case, showing that 
Ada Boost approximately maximizes norm alized margins, and that this leads to good generaliza- 
tion ( Schapire and Freundl . in preparation! Chapter 5 and the references therein). It is historically 
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interesting that this setting, which entails the non-existence of the best parameters, is diametrically 
opposed to the parameter estimation setting above. 

In order to produce a more general analysis, it was necessar y to control the unbou nded iterates. 
This has been achieved eithe r implicitly through regular i zation (Blanchard et al.l, 2003), or explicitly 

with a n early stopping rule (jBartlett and Traskinl 120071 1 Zhang and YuO 20051 IScha pire and Freundl, 

in prepa rat ion). Those analyses which handle the case of AdaBoost (cf. the work o f Bartlett and Tr askinl 



(2007) and iSchapire and Freundl ( in preparationl Chapter 12)), are sensitive both to the choice of 
exponential loss, to the choice of minimization scheme, and to the choice of stopping condition. 

The goal of this manuscript is to analyze the setting of minimizing an unregularized convex loss 
applied to a finite sample (i.e., just like logistic regression and AdaBoost), but for a large class of 
loss functions, and without any demands on the optimization algorithm beyond an ability to attain 
arbitrarily small error. 



1.1 Contribution 

In more detail, the primary characteristics of the presented analysis are as follows. 

Any minimization scheme. The oracle producing approximate solutions to the convex problem 
can output iterates which have any norm; they must simply be close in objective value to the 
optimum. The intent of this choice is twofold: for practitioners, it means that focusing on 
minimizing this objective value suffices; for theorists, it means that the wild deviations caused 
by these unbounded norms are not actually an issue. 

Many convex losses. The analysis applies to any convex loss which is positive at the origin, and 
zero in the limit. (Some results also require differentiability at the origin.) In particular, the 
analysis handles the popular choice of using the logistic loss, but also applies to the exponential 
and hinge losses. (For a discussion on the difficulti es of generalizing from the exponential loss, 
please see the work of iBartlett and Traskin ( 2007 , Section 4).) 



The main limitation of the presented analysis is that the set of features, or weak learners, must 
be finite. This weakness can be circumvented in the setting of boosting, where the complexity of 
the feature set can increase with the availability of data; it will be shown that the popular choice of 
decision trees fit this regime nicely. 



1.2 Outline 

A summary of the manuscript, and its organization, are as follows. Briefly, primary notation and 
technical background appear in Section [i]. 

Section presents an impossibility result, which forces the structure of subsequent content. 
Specifically, with no bound on the iterates, it is in general impossible to control the deviations 
between the empirical convex risk (the convex surrogate risk over the observed finite sample), and 
the true convex risk (the convex surrogate risk over the source distribution). 

The solution is to break the input space into two pieces: a hard core, where there exists an 
imperfect yet optimal parameter vector, and the hard core's complement, where it is possible to 
have zero mistakes, albeit giving up on the existence of a minimizer to the true convex risk. This 
material appears in Section |j. 

The hard core has direct entailments on the structure of the convex risk. Specifically, Section [Bl 
establishes first that the true risk has quantifiable curvature over the hard core, and effectively zero 
error over the rest of the space. Additionally, with high probability, this structure carries over to 
any sampled instance. 

The significance of first proving properties of the true risk, and then carrying them over to the 
sample, is that quantities dictating the structure of the empirical convex risk are sample indepen- 
denl Consequently, finite sample guarantees, which appear in SectionS, display a number of terms 
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which are properties of the true convex risk, and not simply opaque random variables derived from 
the sample. It is thus possible to control many such bounds together; the eventual consistency 
results, appearing in Section 0, simply combine the finite sample guarantees, which all share the 
same primary structural quantities, together with standard probability techniques. As discussed 
previously, in order to fit arbitrary decision boundaries, structural risk minimization is employed, 
and it is furthermore established that decision trees with a constraint on the location of splits meet 
the requisite structural risk minimization condition. 

Note that all proofs, as well as some supporting technical material, appear in a variety of appen- 
dices. 



2 Notation 

Definition 2.1. Instances x E X will have associated labels y E y = { — !,+!}. fi will always denote 
a probability measure over X x y, with only occasional mention of the related cr-algebra. O 

To achieve generality sufficient to treat boosting, instances will not be worked with directly, but 
instead through a family of feature maps, or weak learners. 

Definition 2.2. Let H — {/lil^Li denote a finite set of (measurable) functions H 3 h : X ^ [—1, +1]. 
Call a pair {'H,ij,) a linear classification problem. For convenience, let H denote a (bounded) linear 
operator with elements of 'H as abstract columns: given any weighting A € R", 

n 
1=1 

For convenience, define related classes of functions 

span(-H,6) := {H\ : X e M", ||A||i < 6}, 

OO 

span(-H) := |J span(H, b) = {HX : X e M"} . O 

6=1 

The class span(?^) will be the search space for linear classification; if for instance H consists of 
projection maps, then this is the standard setting of linear regression, however in general it can be 
viewed as a boosting problem. That the range of the function family is fixed specifically to [—1, +1] 
is irrelevant, however compactness of this output space is used throughout. 

Definition 2.3. $ contains all convex losses cj) which are positive at the origin, and satisfy lim2^_oo <l>{z) 
0. O 

This manuscript makes the choice of writing losses as nondecreasing functions; in this notation, 
three examples are the exponential loss exp(z), logistic loss ln(l-t-exp(z)), and hinge loss max{0, 1 + 
z}. Some of the consistency results will also require the loss to be differentiable at the origin; this 
requirement, which is satisfied by the three preceding examples, will be explicitly stated. 

Definition 2.4. Given a probability measure fj,, a loss G $, a function class J-, and arbitrary 
element f E J-, the corresponding risk functional, and optimal risk, are 



When a sample S := {(a^i, 2/i)}™ i is provided, let 7?,™ denote the corresponding empirical risk, 
meaning the convex risk corresponding to the empirical measure /im(C') "^^^X^illi ^(i^iTlJi) ^ 
C), thus 7?.™(/) = m^^ J2i 4'{~yif{xi))- Lastly, let C denote the classification risk C{y, y') :— t{y ^ 
y'), and overload the notation for risks so that 



TZcif) ■■= J C{y,2- 1(/(.T) > 0) - l)dii{x,y), TlciT) = jf^T^df)- 



O 
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Typically, some function class 'H, a particular weighting A G M", and perhaps a sample of size m 
will be available, and example relevant risks are TZ^{HX), TZ^{HX), TZ^{span{'H)) . 

Definition 2.5. The requirement placed on the minimization oracle is that, for any G 
finite sample of size m, and suboptimality p > 0, the oracle can produce A G M" with TU^{H\) < 
7^5'(span(H))+p. O 

The theorems themselves will avoid any reliance on this oracle, and their guarantees will hold 
with any p-suboptinial A as input; this manuscript is concerned with statistical properties of these 
predictors. However, note briefly that for many losses of interest, in particular the hinge, logistic, 
and exponential losses, oracles satisfying the above guarantee exist. 



Proposition 2.6 f( Nesterovl . l2003l iTelgarskv . 2012 )). Let a linear classification problem (1-1,(1), 



finite sample of size m, and suboptimality p > be given. Suppose: 

1. Either <j) is Lipschitz continuous, attains its infimum, and subgradient descent is employed; 

2. Or 4> is in the convex cone generated by the logistic and exponential losses, and coordinate 
descent is employed (as in AdaBoost); 

then poly(l/p) iterations suffice to produce a p-suboptimal iterate A G R". 

(The proof, in Appendix 0, is mostly a reduction to known results regarding subgradient and 
coordinate descent.) 

Lastly, this manuscript adopts a form of event-defining notation common in probability theory. 

Definition 2.7. Given a function f : A B and binary relation ~, define [f ^ b] := {a G A : 

f{a) ^ 6}; for example [/ > 0] {a G A : /(a) > 0} = /~^((0, oo)). At times, the variables will 
also be provided, for instance [bf{a) > 0] = {(a, b) ^ Ax B : bf{a) > 0}. O 

3 An impossibility result 

The stated goal of allowing iterates to have unbounded norms is at odds with the task of bounding 
the convex risk TZ^. 

Proposition 3.1. There exists a linear classification problem (H,/!) with the following character- 
istics. 

1. X is the square [—1,+!]^, and H consists of the two projection maps. 

2. fi has countable support. 

3. There exists a perfect separator, albeit with zero margin. 

4. For any G 7?.0(span('H)) = 0. 

5. Let any finite sample {{xi,yi)}™^i, any b > 0, and any cf) G ^ be given. Then there exists a 
maximum margin solution X, i.e., a solution satisfying 

argmin^^^^^^^^ = sup i argmmyi{HX){x^) : A G M",||A||i = 1 i , 

ie[m] ||A||i iG[m] J 

which has TZ^{HX) > b. 
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Figure 1: A bad example for unconstrained linear classification; please see Proposition 



3J. 



A full proof is provided in Appendix[2, but the mechanism is simple enough to appear as a picture. 
Consider the linear classification problem in Figure [l|, which has positive ("+") and negative ("-") 
examples along two lines. Optimal solutions to TZc are of the form cA, where A = (—1, +1) and c > 
(note limcioo ^<^(cA) = 7?,0(span(H)) = 0). Unfortunately, the positive and negative examples are 
staggered; as a result, for any sample, every max margin predictor A, which is determined solely by 
the rightmost "+" and uppermost "-" , will fail to agree with the optimal predictor on some small 
region. A positive probability mass of points fall within this region, and so, by considering scalings 
cA as c t oo, the convex risk TZ^ may be made arbitrary large. 

The statement of Proposition l3.ll is encumbered with details in order to convey the message 
that not only do such examples exist, they are fairly benign; indeed, the example depends on the 
additional regularity of large margin solutions. The only difficulty is the lack of any norm constraint 
on permissible iterates. 

On the other hand, notice that the classification risk TZc is not only small, but its empirical 
counterpart TV£ provides a reasonable estimate as m increases. Furthermore, if the distribution 
were adjusted slightly so that every A e M" made some mistake, then these unbounded iterates 
would fail to exist: the huge penalty for predictions very far from correct would constrain the norms 
of all good predictors. 

The preceding paragraph describes the exact strategy of the remainder of the manuscript: linear 
classification problems are split into two pieces, one where optimization may produced unboundedly 
large iterates with small classification risk, and another piece where iterates are bounded thanks to 
the presence of difficult examples. 



4 Hard cores 



One way to split a linear classification problem into two pieces, one bounded and one unbounded, is 
to identify a hard core of very diffi cult instances . (Note, forms of the hard cor e have been previ ously 
used to study linear classification ( Impagliazzol 19951 iMukheriee et al.l . [2011 , Telgarskv . 20121) .) 

Definition 4.1. Given a linear classification problem (7^,/x), let 'D{'H, ^) denote reweightings of fj, 
which decorrelate every regressor HX; that is. 



|p e L\^l) : p > 0, VA e M" . y" y{HX){x)p{x, y)diji{x, y) = 



Correspondingly, S-DiHTfJ-) tracks the supports of these weightings: 

Sv{n,ti) ■.= {[p>o] ■.pev{n,ti)}. 

A hard core Q X y, y for {%, /i) is a maximal element of S-niTi, /i); that is, 

'^eSvin,fi) and yC e Svin.fi) -nC^XC) >0 and fiiC\'i^ 



= 0. 
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("Maximal", in the presence of measures, will always mean up to sets of measure zero.) O 

Momentarily it will be established that hard cores split problems in the desired way; but first, 
note that hard cores actually exist. 

Theorem 4.2. Every linear classification problem [%, fi) has a hard core. 

To prove this, first observe that S-piH, /i) is nonempty: it always contains 0, with corresponding 
reweighting p{x, y) = 0. In order to produce a hard core, it does not suffice to simply union the 
contents of SviH, fi), since the resulting set may fail to be measurable, and it is entirely unclear if a 
corresponding p € 'D('H, /i) can be found. Instead, the full proof in Appendix constructs the hard 
core via an optimization, and the observation that ST>{'H,fi) is closed under countable unions. 

With the basic sanity check of existence out of the way, notice that hard cores achieve the goal 
laid out at the closing of Section [^. The proof, which is somewhat involved, appears in Appendix [2- 

Theorem 4.3. Let problem (UtIJ-) and hard core 'if be given. The following statements hold. 

1. There exists a sequence with y{H\i){x) = for fi-a.e. (x, y) G and y'{HXi){x') f oo 
for ^i-a.e. {x',y') S '^''. 

2. Every A e M" satisfies either /^("^ n [y{HX){x) = 0]) = a^C^) or n [y{HX){x) < 0]) > 0. 

The first property provides the existence of a sequence which is not only very good /i-a.e. over 
"rf^, but furthermore does not impact the value of HX over that is to say, this sequence can 
grow unboundedly, and have unboundedly positive margins over while optimization over 
can effectively proceed independently. On the other hand, is difficult: every predictor is either 
abstaining fi-a.e., or makes errors on a set of positive measure. 

Finally, corresponding to the hard core, it is useful to specialize the definition of risk to consider 
regions. 

Definition 4.4. Given a set C (typically 'j^ or '^^), loss (f), function class J^, and any f G define 

n^;c{f) ■■= J ^(-2//(a:))l((x, y) G C)d^i{x, y), 7^^;c(-F) := m^7^^;c(/), 

with analogous definitions for 7?.™^, TI^.q, etc. O 



5 Hard cores and convex risk 



The hard core imposes the following structure on T?,^. As provided by Theorem 14.31 . there is a 
sequence which does arbitrarily well over without impacting predictions over On the other 
hand, since mistakes must occur over convex losses within $ will be forced to avoid large predic- 
tors. 

Theorem 5.1. Let problem (1-1,11), hard core , and loss G $ &e given. 

1. There exists a sequence {Xi\°^i withy{HXi){x) = for ^-a.e. {x,y) G and\[iRi^oc4>{—y'iHXi){x')) 
Q for n-a.e. {x\y') G "^^ 

2. Let any p > be given. Then there exists Cp G M and a set Np with fJ.{Np) = so that for 
every X G M" with TZ^-<^{HX) < 7l^.<^{spaxi{'H)) + p, there exists a representation X' G M" with 
HX = HX' over'^XNp, and \\X'\\i<Cp. 

The structural properties of the true convex risk transfer over, with high probability, to any 
sampled problem. Crucially, the various bounds are quantified outside the probability; that is to 
say, they do not depend on the sample. 
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Theorem 5.2. Let problem {T-L^^), hard core 'la , and loss cf) € ^ be given. 

1. With probability 1 over the draw of a finite sample, there exists A G K." so that every (xi, j/i) G 
'^'^ satisfies yi{HX){xi) > 0, and every [x'^^y'j) G "if satisfies y[{H\){x[) = 0. 

2. Given any empirical suboptimality p > 0, there exist c > and b > so that for any S > 0, 
with probability at least 1 — d over a draw of m points where m<^, the number of points landing 
in , has bound 

> c2(ln(n) +ln(l/(5)), 
then every p-suboptimal A G M" over the sample restricted to , meaning 

7^^<^(i^A) <7^^^(span(H))+p, 

has a representation A' with ||A'||i < b which has HX — HX' over the sample restricted to '£ , 
and in general fi-a.e. over . 



6 Deviation inequalities 



With the structure of the convex risk in place, the stage is set to estabhsh deviation inequahties. 
These wiU be stated in terms of both a convex risk T?.^, but also the classification risk TLc ■ In o rder 
to make this correspondence, thi s manu script relies on standard techniques due to Zhane ( 20041 ) and 
Bartlett. Jordan, and McAuhffd (l2006l) . 



Definition 6.1. Let ^ denote the set of measurable functions over X . 



O 



Proposition 6.2 (iBartlett et al. Let any G $ 6e given with 4> differentiate at 0. There 

exists an associated function : [0, 1] — >■ [0, oo) with the following properties. First, for any prob- 
ability measure /i and any / : A" R, ip{TZc{f) — 'R-cid)) < T^4>{f) ^ Ti-4>i-S)- Second, the inverse 
exists over [0, oo), and satisfies ijj~^{r) \. as r ^ 0. 



O 



Definition 6.3. Given G $, let -0, called the ^-transform, be as in Proposition 16.2. 
The general use of ip is through its inverse, which provides 

UciHX) - Ucm < ^-\n^{HX) - n^d)) 

= (7^0(i/A) - 7^0(span(H)) + 7^^(span(H)) - Tl^i^)) . 

Although -0"^ may be unwieldy, it is frequently easy to provide a useful upper bound. For instance, 
the exponen tial los s has ^~^(r) < 2^/r, the lo gistic loss has ■i/;~^(r) < 4-^7-, and the hinge loss has 



r 



r (jZhantd . |2004 iBartlett et all l2006l) 



Theorem 6.4. Let {T-L,p), , and G $ 6e given. Let a suboptimality tolerance p > be given; 
results will depend on reals c > and b > determined by the preceding terms. The following 
statements simultaneously hold with any probability 1 — 6 over the draw of m samples (with 5' :— 51^ 
for convenience) , and any weighting X G K" which is e-suboptimal (with e < p) for the corresponding 
surrogate empirical risk problem, meaning 7?,™(iJA) < 7?.™(span('H)) + e. 

L Let and m+ respectively denote the number of samples falling into and . Then 

m<g>m (^C^) - v/ln(l/'^')/(2m)) , 
m+ > TO {^^Ji{'^'') - v/ln(l/(5')/(2m)) . 
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2. The true classification risk over the unbounded portion, , has bound 



e , ^ 26(nln(2m+ + l)+ln(4/<50 , 4(n ln(2m+ + 1) + lii(4/J0 

nc,'^a{HX} < —— + 2^ — — + . 6.5 

0(0) y <p(0)m+ TO+ 

// moreover e < 4){0)/m, then 

nc..^m) < 4(»M2.H + l) + ln(4/y)^ (g^g^ 

3. Suppose 

m-g- > c^(ln(n) +ln(6/(5'))- 
The true surrogate risk over the unbounded portion has bound 



c 



7^0;^(i^A) - 7^0;^(span(■H)) < e + ^ — (6.7) 

Additionally, if cf) is differentiable at 0, the classification risk has bound 



+ 7^0;<g■(span(H))-7^0;«=(;?)j. (6.8) 

4- Suppose, for simplicity, that 

m > max{21n(l/(5')/min{/i('r)2,^C^=)2},2c2(ln(n) +ln(l/(5'))/Ai('^)} 

(where bounds are interpreted to hold trivially when denominators contain 0) and additionally 
that e < 0(0) /m and is differentiable at 0. Then the true classification risk of the full problem 
has bound 



nc{H\) ~ TZciS) < e + 



+ 7?.^;.^(span('H)) - 'R^-<^{^) 
8(nln(TO^('^'=) + 1) + ln(4/(5') 



7 Consistency 

In order for the predictors to converge to the best choice, near-optimal choices must be available. 
Correspondingly, the first consistency result makes a strong assumption about the function class, 
albeit one which may be fo und i n many treatments of the consistency of boosting (cf. the work of 



iBartlett and TraskinI (|2007[ ) and lSchapire and Freundl ()in preparationl Chapter 12)) 
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Theorem 7.1. Let [Ti.,^) and (f) G ^ be given with cj) differentiable at 0. Suppose 7?.0(spaii(?^)) = 
TZij,{^). Then there exists a sequence of sample sizes {mij^i t c», and empirical suboptimality 
tolerances {eiji^i i 0; so that every sequence of ti-suboptimal weightings {Ai}J^]^ (i.e., TZ^^^ (HXi) < 
ei + 7?.™' (span(H))y) satisfies TZc{H\i) — > TZci^) almost surely. 

This additional assumption is hard to justify in the presence of only fi nitely many hypothe- 
ses. T o mitigate this, this manuscript follows an approach remarked upon by Schapire and Freundl 



in preparation. Chapter 12): to consider an increasing sequence of classes which asymptotically 



grant the desired expressiveness property. 

Definition 7.2. Let a probability measure ^ be given. A family of finite hypothesis classes {"Hij^^ 
is called a A linear structural risk minimization family for fi, or simply L-SRM family, if for any 
e $ and tolerance e > 0, there exists j so that TZ^{apan{'Hj)) < TZ^{^) + e. O 

The significance of this definition will be clear momentarily, as it grants a stronger consistency 
result. But first notice that straightforward classes satisfy the L-SRM condition. 

Proposition 7.3. Suppose X — , and let a probability measure /i be given where fix , the marginal 
over X , is a Borel probability measure. Let Hi denote the collection of decision trees with axis aligned 
splits with thresholds taken from {—i, —i-\-l/i,...,i— i}. Then {'Hi\°°^i is an L-SRM family. 



Proving this fact, as with many classical universal approximation theorems (jKolmogoro 



CvbenkoL fl989h . relies on basic properties of continuous functions over compact sets. In order to 



reduce to this scenario from the general sc enario of mea surable functions 5, Lusin's Theorem is 
employed, just as with similar results due to IZhantd (|2004 Section 4). 



Now that the existence of reasonable L-SRM families is established, note the corresponding 
consistency result. 

Theorem 7.4. Let probability measure fi and loss (j) £ ^ be given with (f> differentiable at 0, as well 
as an L-SRM {Hi}^^ for fi. Then there exists a sequence of sample sizes {mi}°^^, a subsequence 
of classes {'Hji}'^i, and suboptimalities {ei}fZi7 so that the every sequence of regressors {Hj.Xi}°^i 
ti-suboptimal for the corresponding empirical problem satisfies TZciHj-Xi) — > 'R-ci'S) almost surely. 

This manuscript is basically saying that constraining learning at the level of the weak learning 
oracle is sufficient for consistency. Of course, it could be argued that it is more elegant to instead 
apply a regularizer to the objective function (with data-dependent parameter choice), and permit 
a powerful weak learning class of infinite size. But such a discussion is beyond the scope of this 
manuscript. 
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A Technical Preliminaries 



Lemma A.l. Let any (f) €z ^ be given. Then (p is continuous, measurable, and nondecreas- 
ing. Subgradients exist everywhere, and satisfy 90(0) C M+-(-. Lastly, the conjugate (jf satisfies 
dom(0*) C M+ and 0*(O) = 0. 



Proof. Since (j) is finite eve rywhere, it is continuous (|Rockafellai . Il970l Corollary 10.1.1), and thus 



measurable ( Folland . 19991 Corollary 2.2). Since convex functions are subdiffcrentiable everywhere 



along the relative interio r of their domain s (which in this case is just M), it follows that (f> has 
subgradients everywhere (lRockafellailll970l Theorem 23.4). 



If (j) were not nondecreasing, there would exist x < y with 4i{x) > ^(y); but that means every 
subgradient g e d(t>{x) satisfies 

(t>{y) > <t){x) +g{y-x), 

and thus g < 0. But then, for any z < x, 4>{z) > 4>{x) + g{z — x), which in particular contradicts 
limz^_oo ipi^) — (indeed, it implies lim2_j._oo ipi^) oo)j thus is nondecreasing. 

Next, since is nondecreasing, 90 C R^. However, since 0(0) > 0, it follows that 90(0) C 
since otherwise Vnnz^-oc 4>{'^) = would be contradicted. 

Turning to 0* , first note 

0*(O) =sup0 -2-0(2) =0. 

z 

Lastly, since is nondecreasing, then for any g < 0, 

4i*{g) = supgz — 4>{z) > sup — 0(z) — oo. 

z z<0 

That is to say, dom(0*) C M+. □ 

Proposition A. 2. Let a linear classification problem ('H, v) and loss G $ 6e given. Then given a 
bound b on the norm of considered predictors, there exists c > 0(5) so that, for any 6 > 0, with 
probability at least 1 ~ d over the draw of m points from v, every A S R" with < h satisfies 



Proof. Let bound b and loss g <i> be given. Define a truncation 

^ I 0(z) when z < b, 
1 0(6) otherwise. 



Since is nondecreasing (cf. Lemma I A. 11 ). 0(z) < 0(6), and furthermore is Lipschitz with a con- 
stant that may be measured at 6; indeed, since is finite everywhere, it has bounded subdifferential 
sets ([Rockafellar. .1970. . Theorem 23.4), and thus, taking any 21,2:2 S R and supposing without loss 
of generality that zi < 22, 

10(22) -0(zi)| =0(22) -0(21) 

< sup{0(z2) - (0(22) + (32,21 - 22)) : 32 e 90(22)} 

= \Z2 - 21 1 SUp{|52| : 32 e 90(Z2)} 

< 00; 

correspondingly, set a Lipschitz constant L^ :— sup{|g| : g € 90(6)}. 

Note that for every / S span(?^,6), sup^fz^ \ f{x)\ < 6, and thus Tifj,{f) = TZ^{f). Lastly, the 
desired constant c, which does not depend on 6, n, or m, will be c := m.ax{2L^b^/2, 0(^)}- 
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Now let a sample of size m be given, and let let i?m(span('H, 6)) denote the Rademacher com- 
plexity of s pan(?£,&). By propertie s of Rademacher complexity and a few appeals to McDiarmid's 
inequality, ( Boucheron et all |2005[ Theorem 3.1, and the proof of Theorem 4.1), with probability 
at least 1 — 5 over the draw of this sample. 



sup |7^^(i^A) -7^^(i^A)| = sup n^{HX) -nj{HX) 



||A||i<6 



||A||i<;, 
< 2i^i?„(spanCH,6)) 



21n(2/(5) 



(A.3) 



Next, by i?rra(span(H, 6) ) = 5i?m(span('H, 1)) — 6i?m('H) and an appeal to Massart's Finite Lemma 
( Boucheron et al. . 20051 Theorem 3.3) 



i?m(span('H, b)) < 



2\n{r 



Plugging this into eq. ( A. 31 ') and recalling the choice c = max{2L^6\/2, 0(&)}, the result follows. □ 

Lemma A. 4. Let S d R and convex f : S R be given. If x,y Ci S are given with x < y and 
f{x) < f{y), then for every S B z >y, f{y) < f{z). 

Proof. Write y as a combination of x and z: 



y^x 



V 



By convexity and /(y) > f{x), 



y-x 



+ z 



y-x 



z-y 

z — X 

z-y 

z — X 



fiz) 



Z — X 

z — X 



= f{y)- 

Rearranging and using a; < y, it follows that /(y) < f{z). 



□ 



B Convexity properties of T?.^ 

Lemma B.l. Let finite measure v and G $ 6e given. Then the function 

L°°{v) 3q ^ I (t>{q) e M 



is well-defined, convex, and lower semi-continuous. Next, (L°°(i^))* can be written as the direct sum 
of two spaces, one being L^{y); for any p £ {L°°{v))* , letpi+p2 be the corresponding decomposition 
(with pi g L^iy)). With this notation, J qp2 ~ for any q G L°°[v); furthermore, the Fenchel 
conjugate to the above map is 

{L'^{v)r3p ^ JriPl), 

which is again well-defined, convex, and lower semi- continuous. Lastly, the subdifferential set to the 
first map may be obtained by simply passing the subdifferential operator through the integral. 



d (^j (t^ (q) = {pe {L^{v)r : pi G dm v-&.e.) 
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Proof. The proof will proceed with heavy reliance upon res ults due to Rockafellar ( 197ll) . To 
start, note that (j) , being convex and continuous (cf . Lemma lA.ir ) , is a normal convex integrand 



(|Rockafellaij . |1971L Lemma 1). 

Let Z : A" — > R denote the zero map, i.e. Z{x) = everywhere. Note that (j) o Z £ L^{v), and 
similarly (p* o Z £ L^{v) (since (/)(0) = 0; cf. Lemma [A. it ): these facts provide the conjugacy formula 



4'*{pi) + sup <^ p2{q) : q G L'^iv), I (t){q) < oo 



(B.2) 



where the decomposition p = pi + P2 is as in the lemma statement (iRockafellaii . I1971L Theorem 1) 
Next, notice that dom(/ (/>) = L°°{l'); in particular, given any q G L°°{v), 



m < / 0(lklloo) = cb{h\\ooHx,y) < oo. 



As such, consider an arbitrary p2 and q G L°°{v). Since p is a continuous linear functional on 
then so is p2 (otherwise the formula p = pi + P2 would not make sense). Next, as stated by 



iRockafeilar ( 1971 . introduction to Section 2), it is possible to choose sets Sk with v{Sf.) < 1/fc, and 
P2{q) = over every iSfc and q e L^{v). Now define Uk = Ui<kSi. By continuity of measures from 
below ( Follandl. 19991 T heorem 1.8c), v{Uk) t ^{'^ 3^)- As such, by the dominated convergence 
theorem ([Follandl " 19991 Theorem 2.25), and setting Uo = 0, 



P2q 



P2q 



J2 I 



= 0. 



That is to say, the supremum term in eq. (jB.2l ) is simply zero; plugging this back into eg. (|B.2| ). the 
desired conjugacy relation follows. Note that the same result, due to iRockafellaii (|l97ll Theorem 
1), provides the integrals are well-defined, and moreover that the pair of conjugate functions are 
both convex and lower semi-continuous (as a consequence of being mutually conjugate). Lastly, the 
above derivation has established that / is finite over L°°{i/), but it is possible that J 0* is infinite, 
even over L^{v) (i.e., and not just over (L°°(i/))*). 

For the subdifferential relation, a related resulted bv IRockafeilar ( 197ll Corollary lA) provides 
that {L°°{i'))* 3 p £ d( [ 4'){q) (for some q £ L°°{v)) precisely when pi £ d(j){q) i/-a.e., and the 
supremum in eq. (jB.2[ ) is attained for p2 at q. It was already established that the supremum is 
always zero, as is P2{q), and the result follows. □ 

Corollary B.3. Let a finite measure v and (j) £ ^ be given. The function 



3 A 



4>i-y{HX){x))duix,y) £ 



is convex and continuous. 
Proof Note that 

A i-> -y{HX)x 

is a bounded linear operator (and thus continuous), and the latter object, taken as a function over 
X X y, is w ithin L°°{i'). Combined with the lower semi-continuity and convexity of J as per 
Lemma Ib.iI . it follows that the the map in questi on is conve x and lower semi-continuous. Since it is 
finite everywhere, it is in fact continuous (,Rockafellar , 1970l Corollary 7.2.2). □ 
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Lemma B.4. Let a linear classification problem {J-LtV) and any (/) £ ^ be given. Then 

inf I J (j){~y{H\)x)dv{x, y) : A e K" | max I^J -(f)* {p) : niax{p, 0} G X>CH, i^) | , 

where the max is taken element- wise. Furthermore, if a primal optimum A exists, then there is a 
p € 'D{'H, v) with p{x, y) G d(j){—y{H\)x) v-a.e. 

Proof. For convenience, define tfie linear operator 

{AX){x,y) ■.= -y{H\)x. 
Note that A is a bounded hnear operator, and furthermore has transpose 



n p 

A^p-.^^^Gi j -yhi{x)p{x,y)dv{x,y) 



(this follows by checking {AX,p) — (\,A^p'^ for arbitrary A G R" and p G {L°°{v))* , which entails 
the formula above provides the unique transpose (jRudinl . Il973l Theorem 4.10).) 
Consider the following two Fenchel problems: 



p:=inf<^ / 0(^A) + (O,A) : Ag 



sup 



where ijo} is the indicator for the set {0}, 



when A = 0, 
CO otherwise. 



and is the conjugate to (O,-); additionally, pi is as discussed in the statement of Lemma iB.ll . To 
show P = d and thus prove the desired result, an appropriate Fenchel duality rule will be applied 
( Zalinescu . 2002L Corollary 2.8.5 using condition (vii)). 



To start, note that J (j> and / (p* are conjugates, as provided by Lemma IB.II . Next, also from 
Lemma Ib.iI . J (j) finite everywhere over L°°(y). As a result. 



^dom((0, •)) - dom( j 4>) = ^dom((0, •)) - L°°(i/) = L°^{v). 

The significance of this fact is that it will act as the constraint qualification granting p — d. 

Lastly, E" and L°°(v ) are Banach and thus Frechet spaces. As such, all conditions necessary for 
Fenchel duality are met ( Zalinescu . 20021 Corollary 2.8.5 using condition (vii)), and it follows that 
p — d as desired, with attainment in the dual. 

The next goal is to massage thi s du ality expression into the one appearing in the lemma statement. 
To start, as provided by Lemma Ib.iI . J qp2 = for any q G L°°(i>), and in particular A^ P2 = 0; 
consequently, p2 has no effect on either term in the dual objective, and the domain of the dual may 
be restricted to LHv). 

Next, Lemma I A. ll grants dom((/)*) C and so the domain of the dual problem may be safely 
restricted to p > v-a..e. (since is always dual feasible, and v{\p < 0]) > entails an objective 
value of — oo). By the form of A'^ , Lo{A^p) is finite iff 



/ 



yh{x)p{x,y)dv{x,y) = 
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for all h; it follows that lq{A^p) is finite iff 

■(AA)(...W.,.)..(.,.)^0 

for all A G R". Combining these facts, an equivalent form for the dual problem is 

max|— y : max{p, 0} e , 

just as in the statement of the lemma. 

Lastly, the Fenchel duality rule invoked above, as presented by IZalinescul (|2002l) . also provides 
that a primal optimum A exists iff there is a p' S (L°°(i/))* with —A^p' € 9((0, •))(A) — and 
p' £ d{J (t)){AX). The first part simply states that max jp', | G T>{T-L, v) as above. The second part, 
when combined with the subdifferential rule of Lemma IB.II . gives p'l G d(l){AX) i/-a.e. To obtain the 
desired statement, set p :— max{p'j^, 0}, which satisfies all desired properties. □ 



C Structure of T?.^ over St>{1-L,ii) 

The following theorem leads to a number of properties presented in Sections iandi; it is easiest to 
prove them at once, as a ring of implications. 

Theorem C.l. Let a linear classification problem (UtIJ-) and a set D be given. The following 
statements are equivalent. 

1. For every A € R", either n{D D [y{HX)x = 0]) = fi{D) or fi{D D [y{HX)x < 0]) > 0. 

2. Given any p, there exists a bound b and a null set N C_ X x y (i.e., fJ.{N) = 0) so that for 
every p-suboptimal weighting X over D, meaning any weighting satisfying 

n^-D{HX) < 7^0;I5(span(■H)) + p, 

there exists X' with ||A||i < b and HX — HX' over D \ N . 

3. DeST>{n,p). 

The following structural lemma is crucial. 
Lemma C.2. Let (H,/!) and a set D be given. Define the set 

/C {A e R" : yiHX)x = for p-a.e. {x,y) e D}. 
The following statements hold. 
L K, is a subspace. 

2. There exists a set N with p{N) — so that, for any for any X G M", the orthogonal projection 
A n- A"'- G IC^ satisfies HX = HX^ everywhere over D \ N. 

3. There exists a constant c > so that, for any X G M" with p{Dr][HX 7^ 0]) > 0, |liJA||ioo(^^)/||A-'-||i > 
c, where L°°{pr)) is the L°° metric with respect to the measure defined by p]j{S) — p{D n S) 

for any measurable set S. 
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Proof. (Item[l|) Direct from its construction, /C is a subspace. Crucially, this means that JC-^ is also 
a subspace, and the orthogonal projection A i— > A"*" exists. 

(Item |2|) Given the subspace pair K. and K.-^, for any A € M", there exists the decomposition 
A A'^ + A^, where A^ G IC^. By definition, HX'^ = /i-a.e. over D, and thus H\ = /i-a.e. 
over D. 

Now let Q be any countable dense subset of M". For each Ai e Q, define iV^ := [H\i ^ HX^], 
where the above provides n{Ni) = 0. Set N :— UiNi, which is measurable since it is a countable 
union, and moreover fJ,{N) = by tr-additivity. It will now be argued that the projections onto /C"*" 
give equivalences over D \ N. 

To this end, let any A € R", any {x, y) <E D\N, and any r > be given. Since Q is a countable 
dense subset of M", there exists Xi G Q with ||Ai — A||i < t/2. Now let P-*- denote the orthogonal 
projection operator onto /C"*"; then 

< |(i/A)(.x) - iHX^)ix)\ = |(i/A)(.x) - iHP^X)ix)\ 

- \{H{X - A, + X,){x) - {HP^{X - A, + X,)){x)\ 

< \{HX,){x) - {HXl){x)\ + \H{X - X,){x)\ + \HP^{X - X,){x)\ 

< |0| + ||i/||oc||A - A,||i + ||iJ||oo||-P^||oo||A - A,||i 

< + t/2 + t/2 = T. 

Taking t 10, it follows that HX ^ HX^ over D\N. 

(Itemlj) For the final part, if every A € R" has ^{D n [HX 7^ 0]) =0, there is nothing to show, 
so suppose there exists A £ R" with /i£)([iJA 7^ 0]) > 0. Consider the optimization problem 

'"^{ "^^A^ir^^ : A e W\^iD{[HX ^ 0]) > o| = inf { ||i/A|Uoo(^,) : A e /C^, |lA||i = l} . 

The latter is a minimization of a continuous function over a nonempty compact set, and thus attains 
a minimizer A. But A e /C""- and ||A||i = 1, thus \\HX\\lc^(^^) > 0. The result follows with 
c:=|li?A|U^(^,)>0. □ 

Proof of Th eorem lC.A . (Item[l| Item [3.) Let p be given, and let N be the set, as provided by 
Lemma lC.2l . so that every A € R" has HX = HX-^ everywhere on D\N. Suppose contradictorily that 
the remainder of the desired statement is false; one way to say this is that there exists a sequence 
{Ai}^i so that every equivalent representation over D \ N (i.e., HXi = HX'^ over this set) has 
supj |lA-||i = 00, but 'JZ(j,.D{HK) < '7^</,;_D(span('H)) + p. (It can be taken without loss of generality 
that Ai 7^ for every i.) 

To build the contradiction, choose representation A^ , which satisfies HXj^ = HXi over D\N via 
Lemma C.2 . Note that {A^/||A^||i}^]^ lies in a compact set (the unit ball), and thus let A,-^'' be 



(2) (2) 

a subsequence with A^ /||A^ ||i — > A e R". Since the assumed contradiction was that no represen- 

(2) (2) (2) 

tation is bo unde d, A- is unbounded; since there exists a c > with \\HX] ||l°^(^d)/||A,- ||i > c 

(cf. Lemma lc.2l ). it follows by continuity of H and norms that HffAHioo^^j-,-) > c, and in particular 

p{Dn[y{HX)x -^0]) > 0. 

By assumption (i.e., by Item[ll), since p{D n [y{HX)x ^ 0]) > 0, then p,{p n \y{HX)x < 0]) > 0; 

for convenience, define the set P :— [y{HX){x) < 0]. Thus, for any A € R", taking any g £ d(j){0) 
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(note g > via Lemma lA.ll ). 



^.^ In H-y{H{X + tX)){x)) ^ cl,{-y{HX){x)) 

t-i-oo t 



> lim 



t— >-oo t 

> ^.^ /(</>(0) + g(-y(g(A + tXmmm^, y)eDnP)-f^ 4>{-y{H\){x)) 
-yiHX){x)t{{x,y)eDnP) 

^ j.^ nm + g{-yiHX){x)))l{{x, y)eDnP)~J^ ^{-y{HX)ix)) 

t— »oo t 

>0. (C.3) 

The above statement shows that (/> eventually grows in direction _ff A, and in particular must exit 
the desired p-sublevel set 

Cp {A e R" : U^^HX) < 7^0;D(span(H)) + p}. 

To develop the contradiction, it will be shown that the cons truction of A indicates it should be in 



this sublevel set Cp; the proof will be similar to one due to iHiriart-Urrutv and Lemarechall (|2001 
Proposition A. 2.2.3). 

Since J (f> and Jj^fj) a^'^ convex and lower semi-continuous (cf. Lemma iB.ir ). sublevel sets, in 
particular Cp, are closed convex sets. By construction of A, 

HXj + tHX = lim ( (1 ^)HX, + -J—HXf^ ) e Cp. 



This holds for all i > 0, but since HX ^ 0, eq. (IC.ST ) forces HXi + tHX to leave any sublevel set (for 
sufficiently large i), and in particular Cp, a contradiction. 

(Item[2| => Item[^.) Choose (j) :— exp S and a minimizing sequence a|^^ for TZ,p-D, meaning 

n^.D{HX[^^) 7^0;_D(span(-H)). Choose any suboptimality p, and produce A,-^^ by removing all X^^^ 

with TZcj,-D{HX'j^^) > 7?.0;_D(span('H)) + p (this procedure must be possible, since otherwise {A^^^j^^ 
is not a minimizing sequence). By the assumed statement, there exists 6 > and a null set N so 
that each A^- may be replaced with xf\ where |iAf^||i < b, and i?Af ^ = HXf^ over D \ N, which 
in particular means AJ is also a minimizing sequence. But this is now a minimizing sequence lying 
within a compact set, so, perhaps by passing to a su bsequence X\'^\ it has a limit A G M". Since 



A I— ^ / <p{~y{HX)x) is continuous (cf. Corollarv iB.Sf ). it follows that A attains the desired infimal 
value. 

Applying the duality relation in Lemma Ib.4| to TZ^^d (i-C-, using the measure v = fijj, meaning 
I'iS) — fi{DnS) for any measurable set S), the existence of a primal minimum A grants the existence 
of a dual maximum p satisfying p ^ 'D{'H,v), and moreover 

p{x,y) e d(f>{-y{HX)x) = exp{~y{HX)x) 

v-a.e. As such, the choice p'{x, y) := exp(— y(iJA)(x)) satisfies p' p v-a.e., and thus p' G VCH, v)] 
moreover p' > everywhere, since exp > everywhere. 

This reweighting p' was with respect to so to finish, define p*{x,y) :— p' {x,y)t{{x,y) G D). 
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By construction, [p* > Q] = D. Finally, given any A G R", 

y{HX){x)p* {x,y)dfi{x,y) J y{H\){x)p' {x,y)l{{x,y) G D)dfi{x,y) 

= I y{H\){x)p'{x,y)d^iD{x,y) 



= 0. 

It follows that p* e V(n, fi), and that D e Sv{U, fJ-)- 

(Iteni[^ =^ Item[l|.) Let p e /^) with D — [p > 0]he given, and take any A G R" satisfying 
fi{D n [y{H\)x > 0]) > 0. But notice then, since p decorrelates HX, 

p{x,y)y{HX){x)dfi{x,y) 

p{x,y)y{H\){x)d^i{x,y) + I p{x,y)y{HX){x)dfi{x,y). 

D,y{HX){x)>0 J D,y(H\)(x)<0 

From this it follows that 

p{x,y)y{HX){x)d^i{x,y) = / p{x,y)y{HX){x)d^{x,y) > 0, 

D,y(H\){x)<Q J D,y{HX)(x)>0 



where the inequality follows from fj,{D D [y{HX){x) > 0]) > (jFoUandl . Hool Proposition 2.23(b)) 



The result follows. □ 

D Deferred material from Section Q 

In order to invoke standard results for gradient descent, this proof will use material from Section 
to establish the existence of minimizers. Although those results appear later in the text, they do 
not in turn depend on the material here. 



Proof of Provosition \2.d . Suppose "H, a sample of size to, and suboptimality p > are given as 
specified. Before proceeding, note briefly that the results invoked below — those demonstrating 
0(poly(l/p)) iterations suffice — neglect to provide a mechanism to stop the algorithms, and thus 
provide a proper oracle. But this may be accomplished by measuring duality gap, for instance by 
specializing the duality relation in Lemma Ib.4| to the empirical measure. 

First suppose 4> is Lipschitz continuous, attains its infimum, and subgradient descent is employed. 
Notice that 7?.™ o H is also Lipschitz continuous (since H is a bounded linear operator), so if it can 
be shown that the infimum is attained, the standard analysis of subgradient descent may be applied, 
which in particular grants a (1/ p^) converge nce rate when a step size of 0{l/y/i) is employed, 



where t indexes the iterations (jNesterovl . 120031 Theorem 3.2.2 and subsequent discussion on step 
sizes). To finish, it must be shown that the infimum is attained. 

To this end, let /im b e th e empirical measure of the training sample, and let ^ be a corresponding 



hard core. By Theorem l5.ll . since fj,m is now a discrete measure, a single weighting Aq G R" can be 



extracted out with y{HXo){x) > over and y{HXo){x) = over Also by Theorem l5.ll every 
1-suboptimal predictor to 7?,™ has a representation which lies in a compact set; thus, minimizing 
sequence lies in the compact set, and a minimizer Ao exists. To finish, since liniz^-Qc ^(2) = and 
(j) attains its infimum, necessarily there is a 5 with 4>{z) = for z < b. As such, it follows that 

z+\\HX\\c^ 



A' := A + An 



min{\y,{HXo){xi)\ : {x,,y^) G 
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is an optimum to the full problem. First, it is zero over '^'^j since for any (x, y) € 



y{HX'){x) = y{H~X){x) + y{HXo){x) 



\HX\\ 



mm{\y,{HXo){x,)\ : {x,,y,) G ^^=1 
>~\\HX\\^ + (z+\\HX\\^), 

and the choice of z (i.e., 4){—y{HX'){x)) = 0). Next, A' is equivalent to to A over ^. Finally, if there 
exists some A* which achieves a lower objective value than A', necessarily it would be better than A 
over contradicting optimality of A. In particular, the infimum is attained, and the proof for this 
choice of </) is complete. 

Now suppose that <j) is in the convex cone generated by the logistic and exponential losses; if it 
can be sh own that (p is wi thin G, a class of losses known to possess 0(l/p) convergence rates for 
boosting garskvL [20121 Definition 19, Theorem 21, Theorem 23, Theorem 27), then the result 
follows. 

To this end, first notice that G is a cone: given any c > and g £ G with certifying constants 
ry, / 3, then eg € G with the exact same constants. Since the exponential and logistic losses are within 
G ( Telgarskvl [20 1 2. Remark 46), then so are all rescalings. 

To finish, let (pi and 02 respectively denote the logistic and exponential losses, and let any 
ci, C2 > be given; if it can be shown that ci(f)i + C202 G G, then combined with the earlier cases, 
the proof is complete. First note that 

m 

'^{ci(j>i{Xi) + C2(l)2{xi)) < to(ci(/)i(0) + C202(O)) 
i=l 

implies 

vt . Xi < ln[ 

V C2 

henceforth define c :— to(ci0i(O) + C2(/'2(0))/c2, and as per the definition of G, the constants 77 and 
/3 must be established under the assumption x < ln(c). 

For any x €E (— cxi, ln(c)], since In is convex, there is a secant lower bound 



ln(l + e^) > 



ln(l + c) - 
cT^O 



as usual, there is also the upper bound ln(l + e^) < e^. 
As such, for any x £ (—00, c], since 4>i{^) — e'^/(l + e^), 

ci(f)i{x) + C2(t>2{x) _ ci ln(l + e^) + C2e^ ^ e^(ci+C2) 



ci(f)[{x) + C2(f>2{x) cie=^/(l + e^') + cje^ " e^(ci/(l + c) + ca) ' 
and so it suffices to set /3 :— (ci + C2)/(ci/(l + c) + C2). Furthermore, since 4>i{x) = e^/(l + e^)^, 
Ci<(x) + C20i'(x) cie^/(l + e")2 + C2e" ^ e^(ci+C2) 



< 



Ci(f>i{x) + C24>2{x) ci ln(l + e=^) + C2e^ e^(ci ln(l + c)/c + C2) ' 
thus 77 := (ci + C2)/(ci ln(l + c)/c + C2) suffices. □ 



E Deferred material from Section S 



Proof of Provosition lS.M . As stated in the proposition, set X — [— 1,+1]^ 
projection maps hi{x) = xi and ^.2(2^) = X2 
their corresponding probability mass: 



and y. to be the two 
Next define a set of positive instances {pij^i, and 



Pi 



1^1-0.5-4^ 
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Here are the negative instances: 

= [ 1-0.3-4^-'] ' A^K) = 2-'-'^. 

Notice that /i has countable support, and /x(A') = 1. Furthermore, the vector A = (—1,+!) is 
a perfect separator: given any positive example Pi, {H\){pi) > 0, and given negative example Ui, 
{HX)(ni) < 0. Note however that, as required by the proposition statement, the margins go to zero. 
However, given any (f) € since lim2_s._oo = 0, 

< inf 7^0(i^A) < lim / (j){~y^{Hc\){zi))dfi{z^,y^) 0. 

A ctoo J 

The key property of this construction is that the positive and negative examples are staggered; 
this will cause max margin solutions to avoid A. As such, let any finite sample of size m be given. 
If all drawn examples have the same class y, then X = (1 — y,l + y) (which is a maximum margin 
solution) has either ni or pi on the wrong side of the separator, and by choosing c > large enough, 
n^{cHX) > b. 

As such, henceforth suppose there is at least one positive example, and at least one negative 
example. Suppose j and k respectively denote a sampled positive point pj and sampled negative 
point Uk having highest index among positive and negative examples; these maxima exist since m 
is finite. 

Every max margin solution is determine solely by pj and n^,. To obtain one of them, define 




-(l + (nfc)2)/(2+(pj)l+(nfc)2) 
(l + fe)i)/(2+(Pj)l+(nfc)2) 



To verify that this is a max margin solution, note that for any sampled (positive or negative) point 
Zi with label yt E { — 1, +1}, 

y^iH\)z, > iH\)ip,) = -(i7A)(n,) = - {X,n,) = ^M^^)^ > q. 

By construction, however, {pj)i ^ ink)2, meaning A is not a rescaling of A. As such, A is wrong for 
either all large pi or Ui, and taking A = gA with q large, it follows that 7ltf,{H\) > b. □ 



F Deferred material from Section 3 

Throughout this section, the following notation for measures will be employed 

Definition F.l. Given a measure fi and a set P, let fip he the restriction of ii to P: for any 
measurable set S, Hp{S) — fi{P Ci S). Note also that dfip{x,y) = t{{x,y) G P)d^{x,y). O 



F.l Proof of Theorem 4.2 



In order to establish the existence of hard cores, this section first establishes a few properties of 
'D{'H,fi) and 5x)('H,/x). 

Lemma F.2. Given any {ci}°Zi with q > and {pi\°^i with pi £ V{%,^) and X]jCi||pi||i < oo, 
the limit object poc '■= '^iPi exists, and safisfies p^o G 'D{'H^pl). 

Proof. Let {c;}^]^ and {pi}°^i be given as specified. First, by the monotone convergence theorem, 
the function poo — Tlii'^iPi exists (i.e., all lim its conv e rge p ointwise), is measurable, and safisfies 
/ Poo = X^i / CjPi < meaning poo G L^il^) ( FoUand . 19991 Theorem 2.15). Now let any A € K." 
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be given; note that V'.- f \ciPi ( HX)\ < ||-ffA||ooX]i < oo. Thanks to this, by the dominated 

convergence theorem (tFohandl . [1999I. Theorem 2.25), 

/ Poo{x,y)y{H\)xdii(x,y) = / ^c^pi{x,y)y{HX)xdii{x,y) 

00 „ 

i=i 

00 ^ 

^^c, / p^{x,y)y{H\)xd^{x,y) 



1=1 

= 0. □ 



Lemma F.3. Svi'H,^) is closed under countable unions. 

Proof. Let any coUection {Ci}°^i with d G Sv{'H, ^) and corresponding weighting p.; e ^l) be 

given. Define 



C y C, and P -^Y^ 



Pi 



^ 2*max{l, \\pi\\i}' 

By Lemma If.2| . p exists and satisfies p € V{'H,ii). Note further that C = [p > 0], and thus 



Proof of Theorem U.^ . Consider the optimization problem 

d sup{^(C) : C £ Sv{'H,p)}. 

Since Sxi is nonempty (always contains corresponding to p = G 'D{'H^i-i)) and ^[X x 3^) < cxi, 
the supremum is finite. Let jCjl"^ -, be a maximizing sequence, and define Dj :— Ui<jCi and 
D := l)f^^Dj = U^iCi. By LemmalpJ, Dj e SviU^n) for every j, and since n{Dj) >~n{Cj), it 
follows that {Dj}°°^^ must also be a maximizing sequence to the above supremum. Finally , since 
Lemma If.31 also grants D G then by continuity of measures from below (jFoUandl . flOoi 

Theorem 1.8(c)), 

fi{D) = lim ^i{Dj) = d. 

Since D G S-piH, /i) attains the supremum, it is a dual hard core. □ 
F.2 Primal hard cores 

In light of the duality relationship for TZ^ (cf. Lemma [b. 4 ). the definition for hard cores, provided 
in Section 3, is tied to the convex dual to T?.^. Analo gous ly, it is possibly to define a primal form of 
hard cores, which will be lead to a proof of Theorem |4. 3 . 

Definition F.4. Define S-p{H,fJ,) to contain all sets C for which there exists a sequence {Xi}°Zi 
satisfying the following properties. 

1. Every Ai and {x,y) G C satisfies y{HXi)x = 0. 

2. For /i-almost-every {x,y) in C, y{HXi)x '\ 00. 

A primal hard core ^ is a minimal set within S'p{'H, /i): 

^eSv{n,fi) and yC eSv{n,n).fi{:3^\C) = 0Afi{C\^)>Q. O 
Lemma F.5. S-p{'H,iJ,) is closed under countable intersections. 



21 



Proof. To start, note that S-p{H,fJ,) is closed under finite intersections as follows. Let {Ci}f^i be 



given with corresponding sequences {A^*''}"^^. Define C := DCi and J2i aI'"*- By construction 



for every {x, y) £ C and pair (i, j), y{H\^^^)x = 0, and thus y{HXj)x — 0. Next, for each d, define 

C'i ^ Ci with ^(C-) = fJ-iC^) so that, for every (x, y) e y{HX^^^)x t oo. Correspondingly, define 
C" := UiCj', where fJ,{C') = l^-iC'^). Now let any e C" and any _B > be given. For each i, 

there are two cases: either this is an area where y{H\^^^)x f oo, or y{HX^j^^)x = 0. In the first case, 
let Ti denote an integer, as granted by y{HX^^^)x f oo, so that for aU j > T^, y{HXf)x > B. For 

those i where (x, y) ^ C- (but still {x,y) G C), due to the ruled out nuUsets, y{HX'"-'^)x = 0, safely 
set Ti = 0. To finish, taking T := max^ T^, it follows that for every j > T, y{HXj)x > B, whereby 
it follows that y{HXj)x f oo over C", and thus over C"^ fi-a..e. 

Now let a countable family {-Dij^i be given, and define D = DiDi. Consider the optimization 
problem 

p := inf I / e^Y>{~y{HX)x)d^iD-{x,y) : X e M",V(a;,y) € D.y{HX)x = 



Define Ej := ni<,D; , whereby Z? := rijEj. Since //(A" x 3^) < oo, by continuity of measures from 
above (jFollandl . ll999l Theorem 1.8(d)), for any r > there exists Ek with n{D) > n{Ek) — r. Since 



it was shown above that S'p{'H,p) is closed under finite intersections, E^ = C\i<kDi G S-p{'H,fJ,); 
consequently, let {Ai}^^ to be a sequence of predictors certifying that Ek G S-p{H, /i), as according 
to the definition. It follows that 

P < .fiin J exp{-y{HXi)x)fiD'= {x,y)=0 + J cxp{0)^iEk\D = i^-iEk) - ^^{D) < r. 

Since r was arbitrary, it follows that p = 0. 

As such, for any n £ choose A„ G M" with y{HXn)x ~ over D satisfying 

exp{-y{HXn)x)d^iD'=ix,y) < Xjr?. 

By Markov's inequality, it follows that 

/x_Dc([exp(-?/(i7A„)a;) > 1/n]) <n \ exp(-j/(iJA„)x)^_Dc (x, y) < 



As such, by definition, exp(— j/(iJA„)x) converges in measure to the functio n \((x,%i] € U ). Conse- 
quently, there exists a subsequence A* with exp{—y{HX*)x) — >■ l(-D) /i-a.e. (lFollandl . ll999l. Theorem 



2.30). This is only possible if y{HX*)x f oo for ^-a.e [x, y) £ D'^, and the result follows, with {A*}^;^ 
as the certifying sequence for D, since every y{HX*)x = for (x, y) £ D hy construction. □ 

Theorem F.6. Every linear classification problem {%, fi) has a primal hard core. 

Proof. Consider the optimization problem 

p:=M{n{C) ■.C£SviH,n)}. 

Since S-p is nonempty (it always contains X x y with certifying sequence A^ = for every i) and /i 
is a finite nonnegative measure, the infimum is finite. Let |C,;| °^i b e a minimizing sequence, and 
define := n^<JC^ and D := nj^i^'j = (^^^lC^■ By Lemma [fJ, Dj £ S-p{H,n) for every j, 
and since fJ.{Dj) < iJ.{Cj)^ it follows that {-Dj}°?j_must also be a minimizing sequence to the above 
infimum. Finally, sinc e ^, is finite an d Lemma IF.SI also grants D £ S-p{H,iJ.), then by continuity of 
measures from above (|Follandl . [l999[ Theorem 1.8(d)), 



fi{D) = lim fJ-{Dj) = p. 

Since D £ 5^(7^, /i) attains the infimum, it is a primal hard core. □ 
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With existence of primal hard cores out of the way, the next key is the equivalence to (dual) 
hard cores. 

Theorem F.7. Let a linear classification problem {T-Lt^i) he given, along with a hard core , as well 
as a primal hard core . Then and agree on all but a null set. 

The proof needs the following lemma. 

Lemma F.8. Let a linear classification problem {T-L,^), Ci G S-p{'H,fi), as well as a A2 G M" 
be given, with y{H\2)x > for {x,y) £ Ci (but potentially y{H\2)x < elsewhere). Then Ci \ 
[yiH\2)x>0]eSr{n,ii). 

Proof. Let Ci,A2 be given as specified. Let {A^'^''}^]^ be a certifying sequence for Ci. Define 
P := [y{HX2)x > 0] and C3 Ci \ P = Ci \ [y{HX2)x > 0]. 

Now let i G be arbitrary; the following steps will construct A^-'*-', a certifying sequence for 
C3, meaning C3 E S-piTi.,^). 

(2) 

First, let c be sufficiently large so that A^ := CA2 satisfies 

eM-y{H\f^)x)iip{x,y) < 

By Markov's inequality, it follows that 

/zp([exp(-2/(i/A2)x) > l/i]) < i J exp{-y{H~X2)x)fXp{x,y) < l/i. (F.9) 

Consequently define Pi := [y{H\2)x > ln(i)], where the above statements show /x(Pi) > fJ.{P) — l/i. 

Next, since exp{—y{HX[^^)x) — )■ l(Ci) ^-a.e. and ^j,{Xxy) < 00, by Egoroff's Theorem ()Follandl . 

I999I Theorem 2.33), this convergence is uniform over a subset St with fJ.{Si) > fj,{X,y) — l/i. In 



particular, there exists an integer Ti so that, for any (x, y) G Si D Ci, 

y{HXi^^)x>\\X^^\\,+ln{i). 
As such, define A^^ :— A^.^ + A^^ First, for any {x, y) € C3 and any i, 

y{HXf ^)x = - v{H\f '^)x = y{HX2)x. 
On the other hand, for any (x, y) £ Si D Ci, 

yiHXf^)x = y{HXi^^)x + y{HXf'^)x 

>||Af)||i+ln(z)-||Af'||i = ln(z). 

Lastly, as shown above, for any [x, y) £ Pi, 

y{HXf^)x = + y{HX^^)x > \n{i). 

Combining the above facts, 

exp{-y{HXf'^)x) ~ \\{x, y) € ^3]! > 1/*]) < ^i{C'i \ S,) + /i(P \ P^) < 2/i. 

It follows that exp{-y{HXr')x) t{{x,y) e C3) in measure, and thus there is a subsequence 
{Af^l^i which converges to t{{x,y) G C3) /^-a.e. (|Follandl . [l999l Theorem 2.30). It follows that 
{^i^^}i^i is the desired sequence certifying that C3 G Sp{/H,ii). □ 
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Proof of Theorem ] F.Ti . If > 0, then by the maximality of ,^3^ is a set of pos itive measure 

away from any element of Svi'H, /i), an in particular ^ ^ S-pijH, fJ-), and thus Theorem lC.il provides 
the existence of A g M" with fj.{^n[y{HX)x > 0]) = fj.{^) and n{^n[y{HX)x > 0]) > 0. But then, 



by Lemma If. 8l . ^ can be reduced into a smaller element of S-piTi, /i), contradicting its minimality. 
Now suppose ^{'^ \ 3^) > 0, and set v to to be the restriction of fi to 'if: for any C, v{C) := 
n C). Consider the optimization problem 



inf 



exp(-y(ffA)(x))di^(x,y) : A G 



Consider the sublevel set of 1-suboptimal points for this problem. By Theorem lC.il there exists B 
so that each A in this sublevel set has A' with HX = HX' /i-a.e. and ||A'||i < B. However, by the 
definition of there exists a sequence {Xi}^^ which is zero over and approaches cxi fi — a.e. over 
and in particular over the positive measure set 3^. Thus, taking any A in the 1-suboptimal 
set, notice that 

lim / ejip{-y{H{X + X,))x)diy{x,y) = / exp(-y(ff A)(x))l((x, y) ^ <^)d;/(a;, y) =: p. 

i->-coJ J 



Since A has a bounded representation, exp(— A)x) 7^ 0, and thus p < TZ,f,{HX) ( Folland . 19991 
Theorem 2.23(b)). But since the objective function is continuous in A (cf. Lemma Ib.iI ). there 
must exist a large j so that TZ^{H{X + Xj)) < TZ^{HX), and moreover y{H{X + Xj)){x) > B for 
a subset of with positive measure. But that means A + Xj is in the 1-sublevel set, but can not 
have a re pres entation with norm at most B (since iJ is a bounded linear operator), contradicting 
Theorem [cZl. □ 



F.3 Proof of Theorem 4.3 



This is now just a co nsequ ence of the equivalence to primal hard cores, and the structure over 
developed in Theorem lC.il (which was used to prove the equivalence to primal hard cores as well). 

Proof of Theorem The second property is direct from Theorem lC.il . For th e firs t property, since 
primal hard cores exist and are /z-a.e. equivalent to hard cores (cf. Theorem IF.TI ). and statement 
thus follows by taking the sequence provided by the definition of any primal hard core. □ 



G Deferred material from Section 5 



Proof of Theorem [il . (Item[l|) Let {Ai}=^i be given as per Theorem l4.3l Automatically, y{HXi)x ~ 
for {x,y) £ . And since y'{HXi)x' f oo for /z-a.e. {x',y') G it follows from the definition of 
$ that lim^^^oo 4>{-y'{HXi)x) = 0. 

(Itemi) This is a consequence of Theorem IC.ll . □ 



Proof of Theorem lsA . (Item[ll) Let a sequence {Ai}^;^ be given as provided by Theorem 4j. In 
particular, exp (—y(HXi)x) — >■ t{'^) ^-a.e. Now choose a finite sample size m; by Egoroff's Theorem 
(|Follandl . [l99a Theorem 2.33), for any r > 0, there exists Sr with ij.{St) > x y) — r/m over 
which this convergence is uniform. As such, choose Ar so that exp(— y(iJAT-)a;) < 1/2 over St n 
meaning in particular y{HXr)x > for every {x,y) G S^- H The probability over a draw of m 
points that some within "rf^ are misclassified by Ar has upper bound bound 

^l"\3{x^,y^) G . y{HX,)x < 0) < m/iC^" n [y{HXi)x < 0]) < r. 

Since r can be made arbitrarily small, the proba bility of failure is zero. Furthermore, since A,- 
satisfies y{HXT){x) — /i-a.e. over (cf. Theorem |4. 31 ). it also follows that, with probability 1, Ar 
abstains on every example falling within 



24 



(Item (3) Let p > and G $ be given. Choose & > 0, as provided by Theorem 5A, so that 
every A € R" with TZ^-'^{HX) < TZ^-'^{span{H)) + 4 + p has a representation A' with ||A'|i < 6, 
where HX = HX' everywhere along '^\N, where n{N) — 0; henceforth, rule out the event that any 
example falls within N. Additionally, choose c > as provided by Proposition |A.2| so that, given 
m<^ i.i.d. points within '^^ \ N, every / S span(H, b) has 



|7^^;^(/)-7^;x/)| <c 



(G.l) 



Now consider any A G M" with no representation ||A'||i < 6 so that HX = HX' over '^\N , which 
directly entails, by Theorem Is.lL that TZ(f,-<g{HX) — 7?.0;<^(span(H)) > p + 4. Additionally choose 
and any A S M" with TZ^-',g{HX) — 7?.0;<if (span(H)) < 1, whereby the choice of 5 > indicates that, 
without loss of generality, ||A||i < h. Since / (j)o H is continuous (cf. Corollarv lB.ST ). considering the 
line segment {aX + (1 — Q!)A : a £ [0, 1]}, there must exist A with 

P + 3 < 7^0;<^(i^A) - 7e0;<^(spanCH)) < p + 4; 

let A' be a representation with ||A'||i < b and HX ~ H X' over 'rf \ N (and thus it holds for every 
example). Applying the deviation inequality in eq. (|G.lr ) twice. 



,/Hn) + ,/W/S) 



TZ^f^{HX') - 7^0;.ir(span(■H)) - (TZ^.^iHX) - 7e0;^(spanCH))) 



2c 



>(p + 3)-(l) 



2c 



where the last step use d the lower bound on me^. Returning to A G M" as specified above, convexity, 
in the form of Lemma O, grants that W^.^iHX) < 7^'^;^(i^A) implies 7^^^(i^A) < 7^J^^(i^A), 
and thus 

nj^^iHX) ~ 7^;;^(span(H)) > 7^^^(i^A) - 7^^^(i^A) > p. 

Since A was arbitrary, it follows that every A with no representation ||A'||i > b that has agreement 
of HX and HX' p-a.e. over does not lie in the empirical p-sublevel set. Since 7?.™<g> is convex 
and continuous, the p-sublevel set is nonempty, and thus every A' within it has a representation 
||A"||i<6. ' ^ □ 

H Deferred material from Section [gI 



Proof of Proposition [6, 



due to Bartlett et al 



This proof is essentially a repackaging of various results and comments 
()2006l ). Fix any cj) G $; is convex, increasing at 0, a nd differentiabl e at , 

20061 



which grants that the corresponding ?/;-transform is classification calibrated ([Bartlett et al 



Theorem 6, although note losses in the present manu script are increasing rather than decreasing). 
It follows that il^iTl rif) - 7^r.(l^)) < Tld f) - TZ^.i^), (iBartlett et all . l2006l Theore m 3, part 3(c)) 

Next, V(0) = jBartlett et al.l . l2006l Lemma 5, part 8), ip{r) > when r > (iBartlett et al.l . 

I2OO6I Lemma 5, part 9 (b)), and since ip is convex by construction ( Bartlett et al.l 20061 Definition 
2), it follows by Lemma I A. 4| that V' is increasing. Since V' is continuous as well, (jBartlett et al. I I2OO6L 
Lemma 5, part 6), it follows that ip h as a well-define d inverse along the image ■(/'([0, 1]). Finally, the 
fact that i as r ; is due to IBartlett et all (jioOft Theorem 3, part 3(b)). □ 
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Proof of Theorem \6.A . Throughout this proof, 5' := (5/8 will be the failure probability of various 
crucial events; the final statement is obtained by unioning them together, and subsequently throwing 
them all out. Note also that some of the statements vacuously hold if = or //('^) = ii{X x y) 
(i.e., when terms depending on either appear in denominators); interpret these expressions as simply 
being oo, whereby the bounds hold automatically. 

(Item[l|) Let S<g and 5+ respectively denote the set of samples landing in and '^'^^ where the 
notatio n proposed in the theorem statement provides m-^ — \S'^\ and m+ ~ \S+\. By a ChernofF 
bound ()Kearns and Vazirani . 11994 Theorem 9.2), basic deviations for these quantities are 



Pr'"[|5^| < (^C^^) - T)m] < exp(-mrV2) = <5', 
Pr'"[|5+| < (^('T'^) - T)m] < exp(-mT2)/2 = 5', 



where t = y 2m^'^(i^)' ^'^'^ denotes the product measure corresponding to /i. Label these 
failure events Fi and F2, and hencefor th ru le them out. 

(Item 0) As provided by Theorem 5^, there exists A S M" with yi{HX)xi > for all {xi,yi) 
falling in and yi{HX)xi — for those landing in Consequently, 



7^A(span(-H)) = inf inf 7^rf,.<if (ff (A + cA)) + 7^0.<^c {H{\ + cA)) 

A c>0 

= inf inf TZ^xiHX) 

X 00 

< 7^0(span(H)). 

Combining this with 

7^0,<g'c(i^A) + 7^0^<^(i^A) = n^HX) < 7^^(span(H)) + e, 

it follows that 

n^,^.{HX) < 7^0(span(H)) - n^,v{HX) + e = 7^^,cg(span(H)) - n^.^^{HX) + e < e. 
Next, since 0(0) > and 4> is nondecreasing (cf. Lemma [A.iI ). 

n-"2^.{HX) 



T^?MHX) < 



m 



To obtain eq. ()6.5^ from here, first notice that 5+, the portion of the sample falling within 
can be interpreted as an i.i.d. sample from the probability measure /i(- n '^)//x('^). Next, the VC 
dimension of span('H) is the VC dimension of linear separators over the transformed space 

{{{hi{x),h2{x),...,hnix)),y) : {x,y) e X xy}; 



namely, it is n. As such, eq . (16. 5f) follows by an application of a relative deviation version of the VC 
Theorem (IBoucheron et al.l . l2005l discussion preceding Corollary 5.2). 

To obtain eq. (j6.6r ). note that e < (f>{0)/m means there are no mistakes over 'rf: 



0(0) > me > m (jZ'^iHX) - -^"(spanCH)) 



>J2H-y^{HX)x,) 
> max (t>{-yi{HX)xi); 
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that is to say, for every (xi ,yi) € S+, < yi { HX)X i. Plugging TZ'^{HX) = into the same relative 
deviation bound as before ( Boucheron et al 1 l2005l discussion preceding Corollary 5.2), the second 
bound follows. 

(ItemQ) By Theorem 5^, there exist constants 5 > and c > depending on "H, ^, 0, so 
that with probability at least 1 — S', ii m<^ > c'^{\n{n) + \n{l/S')), then every p-suboptimal predictor 
over and in particular A, has a representation A' which is equivalent to A /x-a.e. over and 
satisfies ||A'||i < b. As such, since 

7^^^(i^A) = n';]^(HX') and n^.^{HX) = n^.,^{HX'), 

an application of Proposition IA.2| grants 



< n\<^{HX') 
= TVUiHX) 



c(v/h^+ Vln(2/5') 



c yM^+yM27^ 

<7^^<^(span(H))+e+^ 



Next, noting that Theorem [5jJ provides that a minimizing sequence to 7?.0;<i^(span('H)) can be taken 
without loss of generality to lie within a compact set (e.g., points with norm at most 6), it follows 
that a minimizer A exists; by an application of McDiarmid's inequality, with probability at least 
1-5', 



7^^^(span(H)) < 7^^^(^^A) < n^-MHX) + c\ 



'21n(l/(5') 



(Note, A is independent of the sample, thus McDiarmid suffices, with constant c > (f)(b) since A is in 
this initial sublevel set.) Combining these two pieces, it follows that 



c 

7^0;^(i^A) - 7^^;<g■(span(•H)) < e + - 



yinR + 4yh^(27F 



which is precisely e g. (|6.7r) . 

To pro duc e eq. (j6.8r ). the definition of the ■(/i-transform (cf. Proposition |6.2 ). combined with 
Equation provides 

TZcMHX) ~ TZcMS) < (7^0;«■(i^A) - 7^^;^(^)) 

= {n^,^{HX) - 7^^;^(spanCH)) + 7^0;^(spanCH)) - 7^0;^(5)) 



c yM^ + 4yM27F) 

< V-" e + ^ ^= + 7^0;«'(span(H)) - n^:<g{d) 



(Item[3) Combining the lower bound on m with Item[l|, 

> mfi{'^)/2 > c2(ln(n) + ln(l/(5')); 
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the first two bounds will allow expressions to be simplified, whereas the last bound will allow an 
invocation of item 

As such, combining all preceding bounds (and making use of the refinement over "^"^ when 

< iZ-^M e + ^ — ^ + 7^0;<g.(span(■H)) - U^.^i'S) 

^ 4(nln(2m+ + 1) + ln(4/(5') 



ma 



< V'"' h + ^ , — ^ + 7^^;<^(span(H)) - n^.-^m 

8(nln(m/i(<r'=) + 1) + ln(4/(5') 



□ 



I Deferred material from Section 

Proof of Theorem Let he a hard core for set p := 1, and let 6 > and c > be 

the corresponding reals provided in the guarantee of Theorem |6. 4 Note first that 7?.0(span('H)) 



TZci,{^) i mplie s 72.0;<i^(span('H)) — Tlip-^i'S), since predictions are ^-a.e. perfect off the hard core (cf. 



Theorem Is. ir ). Set 5i = and choose too large enough and J, small enough so that the 



relevant finite sample bound from Theorem |6.4| holds, and goes to zero. (Note that all bounds go 
to zero as mi t cx) and ti \, 0; the word "relevant" refers to choosing a bound corresponding to the 
regime /i('^) = 0, or ^{'^'^) = 0, or min{/i('^), /i('i''^)} > 0.) Note, by the strong assumption, the 
term 7?.0;<^(span('H)) — TZ^-'^{^) may be dropped. 

Now let Fi be the failure event of the corresponding finite sample guarantee; by choice of 6i, 
^jPr(Fi) = X^i — 7r^/6 < oo. Thus, by the Borel-Cantelli Lemma and de Morgan's Laws, 
Pr(liminfi_^oo F[) — 1, meaning Pr(3j .Wi > j . Ff) = 1. This means that the bounds hold for all 
large i (with probability 1), and the result follows by choice of m^ and e^. □ 



Proof of Provosition \7.,% This proof will proceed in the following stages. First, it is shown that the 
infimal risk TZ,j,{^) can be approximated arbitrarily well by bounded measurable functions. Next, 
Lusin's theorem will allow this consideration to be restricted to a function which is continuous over 
a compact set. Finally, this function is approximated by a decision tree. 

Let fi, (j), {"Hij^xi e > be given as specified. Since the infimum in TZrpiS) is in general not 
attained, let 17 S be a measurable function satisfying 

Next let z > be a sufficiently large real so that (t>{—z) < e/4; such a value must exist since 
limz^-oo (^(2) = 0. Correspondingly, define a truncation of g as 

g{x) := min{z, max{— z, (7(2:)}}. 

There are three cases to consider. If \yg{x)\<z, then (t){—yg{x)) ~ (t){—yg{x)). If —yg(x) > z, then 



by the nondecreasing property (cf. Lemma I A. ll ). (j){—yg{x)) > 4>{—yg{x)). Lastly, if —yg{x) < — z. 
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then (j){—yg{x)) < (j){—yg{x)) + e/4 by choice of z. Together, it follows that 

< I iH-y9{x))+e/A)dn{x,y) 



<7^0(^?)+e/2, 

which used the fact that n is a. probability measure. Crucially, g is now a bounded measurable 
function. Throughout the remained of this proof, let || • ||„ denote the uniform norm, meaning 

||/||„:=sup|/(a:)|. 

X 

For example, ||g||ti < 00. 

In order to apply Lusin's Theorem and pass to continuous functions with compact support, a few 
properties must be verified. First, since fj,x is a Borel probability measure, it is finite on all compact 
Borel sets. Next, M'' is a separable metric space, and thus second countable. Fin ally, Mf^ is a locally 
compact HausdorfF space. It follows that fix is a Radon measure (lFollandl . lT999l Theorem 7.8). 



Henceforth, set r := e/(8 max{l, By Lusin's Theorem, there exists a measurable 

function h whi ch is continuous, has compact support, satisfies iJ-x{[g 7^ ^]) < "J" smd \\h\\u < \\g\\u 
(lFollandl . [l999l . Theorem 7.10, Lusin's Theorem). But continuity over a compact set implies uniform 



continuity. Furthermore, the convex function (j), restricted to the domain [— z,z], is necessarily 
Lipschitz. As such, it is possible to choose 6 > so that for any x,x' with ||a; — a;'||oo < S and 
any y G { — 1,+!}, it follows that \(j){—yh{x)) — (t){—yh{x'))\ < r. Notice that this in fact holds 
everywhere, since outside of its support h is just zero. 

As such, let T be the smallest integer so that T > sup{||x||oo : h{x) 7^ 0} (which exists since h has 
compact support) and also l/T < 6. For any t >T, construct a simple function approximation / to 
h as follows. Partition the cube [—t,t)'^ into subcubes (formed as a product of half open intervals 
in order to correctly produce a partition) having side length 1/t with vertices at the appropriate 
lattice points granting a correct partitioning. Let {C^jf^^ index this family of subcubes, and let pi 
be some point within each subcube. Define an approximant 

k 

f{x) :=^%,)l(a^eCO. 

1=1 

It follows that, for a point x e Ci and any y e { — 

\^{-yf{^)) - 4>{-yKx))\ = \c^{-yh{p,)) ~ ^{^yh{x))\ < T 

by construction. Since Ci was arbitrary, this holds for every subcube; and it furthermore holds 
outside the support of /, where h and / are both guaranteed to be the constant 0. 
Combining the various approximation components, it follows that 



nf)^ J (b{-yfix))dfi{x,y) 

< Tfi{X ^y)+ J H-yHx))dfi{x,y) 

<e/8+ / c^i-ygix))lig{x) ^ hix))dfiix,y) + I <j}{-yh{x))t{g{x) ^ h{x))dfi{x,y) 



<e/S + n^{g)+fix{[9^h])cj)m\u) 
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To finish, note by construction that /, which was formed from axis-aligned subcubes at lattice points 
within [—t,t), satisfies / € span('Ht) (the indicator for each subcube can be modeled as an element 
of -Hi). □ 



Proof of Theorem \7.A . Proceed as in the proof of Theorem l7.ll . with one modification. First deter- 
mine €i. At each stage, choose ji large enough so that Tij. satisfies 7?,0(span('Hj. )) < TZ^{'^) + ei\ the 
existence of such a ji is straight from the definition of L-SR M fam ilies. Now choose large enough 
to satisfy the necessary conditions in the proof of Theorem l7.ll : meaning the relevant bound from 
Theorem |6.4| may be instantiated, and furthermore these bounds approach zero as i — >■ oo. Now 
that rrii may be quite massive, as it must now smash the term n — \'Hj^\■ The proof is otherwise 
identical to before. □ 
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